CMSC320 - Final Project

Polynimail Fitting - Daily Temperature of Major Cities

(https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities)

Step #1 - Load and preprocessing of city temperature dataset

The analysis process that I did is this:

We now can explore the data for specific country, Israel for instance

Relation between daily tempreture to 'DayOfYear':

Let us subset the dataset to caintain samples only from the country of Israel, so we can investigate how the average daily temperature (Temp column) change as a function of the DayOfYear

Based on the this plot, one can note that data behaves pretty similar among different year, and it has a shape of a wave, with higher temp around day ~200 of the year.

Since we have three extreme points we can assume that a polynomial with degree of 3 or 4 might be suitable for this data.

The standard deviation of the daily temperatures for each month:

Now we will group the samples by Month and create a bar plot showing for each month the std of the daily temperatures.

Suppose we fit a polynomial model (with the correct degree) over data sampled uniformly at random from this dataset, and then use it to predict temperatures from random days across the year.

Based on this graph, I would expect this model wont succeed equally in prediction across all months. In months with low variance (June [6] - September [9]), I would expect that this model would preform better and will probably will fit closer to reality. I assume that it will do the worst on the months March and April (3 and 4) which are months with high variability.

This is under the assumption the the test set is generated from the same distribution as train set.

Step 3 - Explore differences between countries

And now, back to the full dataset: we will group the samples according to Country and Month, and calculate the average and standard deviation of the temperature.

We will Plot a line plot of the average monthly temperature, with error bars color coded by the country.

Based on the graph above, one can note that not all countries share the same pattern in term of haing the same distibution of average monthly temperature as a funciton of the month.

According to this plot we expect that a model fitted for Israel data only will preform very well on Jordan, whereas the model likely wont work on South Africa or on The Netherlands. This is becuase South Africa's trends are opposite to those of the other three countries (e.g. relatively hot in months 6-9 in Israel, however this is the cold period in South Africa), and on the other hand The Netherlands tempAvg is quite far from those values of Israel. It's distibution (of Netherlands) is similar to that of Israel, with difference of ~9 degrees lower any time of the year. Thus, I can use the model fitted for Israel by simply adjusting the value of the intercept.

Step 4 - Fitting model for different values of the degree hyperparameter

Over the subset containing observations only from Israel we will do the following:

Then we will create a bar plot showing the test error recorded for each value of k. This is in order to find which value of k best fits the data.

Based on this, I would choose the valueof k=5 as best fits and describes the data (the lowest error, above this value it looks like overfitting).

Step 5 - Evaluating fitted model on different countries

Now we will fit a model over the entire subset of records from Israel using the degree of k=5 chosen above.

And create a bar plot showing the model’s error over each of the other countries.

As we expecded, the model fitted over the subset of observations from Israel performed the best on Jordan, and in general it less good over data from other countries. As we have seen in figure 3, the distribution of temperatures in Jordan resembles that of Israel. Therefore, out of the three countries, the model performed best on Jordan.

The distributions of South Africa and Netherlands were further from those of Israel and therefore the fitted model performed poorly over them.

Although the distribution of the temp data from the Netherlands has a very similar shape to that of Israel, and that the distribution of the observations from South Africa is very different, the model performed better over South Africa. This is probably because on average the observations from Israel are closer to those of South Africa. Hence, although the model does not correctly mimics the distribution of observations from South Africa, the errors are still smaller than in the case of observations from the Netherlands.